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Abstract 

We consider the problem of universal decoding for arbitrary unknown channels in the random 
coding regime. For a given random coding distribution and a given class of metric decoders, we 
propose a generic universal decoder whose average error probability is, within a sub-exponential 
multiplicative factor, no larger than that of the best decoder within this class of decoders. Since 
the optimum, maximum likelihood (ML) decoder of the underlying channel is not necessarily 
assumed to belong to the given class of decoders, this setting suggests a common generalized 
framework for: (i) mismatched decoding, (ii) universal decoding for a given family of channels, 
and (iii) universal coding and decoding for deterministic channels using the individual-sequence 
approach. The proof of our universality result is fairly simple, and it is demonstrated how some 
earlier results on universal decoding are obtained as special cases. We also demonstrate how 
our method extends to more complicated scenarios, like incorporation of noiseless feedback, and 
the multiple access channel. 

Index Terms: Universal decoding, mismatched decoding, error exponents, finite-state machines, 
Lempel-Ziv algorithm, feedback, multiple access channel. 
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1 Introduction 



In many situations practically encountered in coded communication systems, channel uncertainty 
and variability preclude the implementation of the optimum maximum likelihood (ML) decoder, 
and so, universal decoders, independent of the unknown channel, are sought. 

The topic of universal coding and decoding under channel uncertainty has received very much 
attention in the last four decades. In [7], Goppa offered the maximum mutual information (MMI) 
decoder, which decides in favor of the codeword having the maximum empirical mutual information 
with the channel output sequence. Goppa showed that for discrete memoryless channels (DMC's), 
MMI decoding achieves capacity. Csiszar and Korner [3] have also studied the problem of universal 
decoding for DMC's with finite input and output alphabets. They showed that the random coding 
error exponent of the MMI decoder, associated with a uniform random coding distribution over a 
certain type class, achieves the optimum random coding error exponent. Csiszar [2] proved that 
for any modulo-additive DMC and the uniform random coding distribution over linear codes, the 
optimum random coding error exponent is universally achieved by a decoder that minimizes the 
empirical entropy of the difference between the output sequence and the input sequence. In [13] an 
analogous result was derived for a certain parametric class of memoryless Gaussian channels with 
an unknown interference signal. 

In the realm of channels with memory, Ziv [21] explored the universal decoding problem for 
unknown finite-state channels with finite input and output alphabets, for which the next channel 
state is a deterministic unknown function (a.k.a. the next-state function) of the channel current 
state and current inputs and outputs. For codes governed by uniform random coding over a given 
set, he proved that a decoder based on the Lempel-Ziv algorithm asymptotically achieves the error 
exponent associated with ML decoding. In [9], Lapidoth and Ziv proved that the latter decoder 
continues to be universally asymptotically optimum in the random coding error exponent sense even 
for a wider class of finite-state channels, namely, those with stochastic, rather than deterministic, 
next-state functions. In [5], Fcder and Lapidoth furnished sufficient conditions for families of 
channels with memory to have universal decoders that asymptotically achieve the random coding 
error exponent associated with ML decoding. In [6], Feder and Merhav proposed a competitive 
minimax criterion, in an effort to develop a more general systematic approach to the problem of 
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universal decoding. According to this approach, an optimum decoder is sought in the quest for 
minimizing (over all decision rules) the maximum (over all channels in the family) ratio between the 
error probability associated with a given channel and a given decision rule, and the error probability 
of the ML decoder for that channel, possibly raised some power less than unity. 

More recently, interesting attempts (see, e.g., [11], [12], [16], [18]) were made to devise coding 
and decoding strategies that avoid any probabilistic assumptions concerning the operation of the 
channel. This is in the spirit of the individual-sequence approach in information theory, that was 
originally developed in universal source coding [22] and later on further exercised in other problem 
areas. In [11], the notion of empirical rate functions has been established and investigated (with 
and without feedback) for a given input distribution and for given posterior probability function 
(or a family of such functions) of the channel input sequence given the output sequence. In [16], 
capacity-achieving (or "porosity-achieving", in the terminology of [16]) universal encoders and 
decoders, namely, encoder-decoder pairs with coding rates as high as the best finite-state encoder 
and decoder, were devised for modulo additive channels with deterministic noise sequences and 
noiseless feedback. This feedback is necessary to let the encoder adapt to the channel, which 
otherwise does not access the channel output and thus cannot learn (either implicitly or explicitly) 
the characteristics of the channel. 

In this paper, we take a somewhat different approach. We consider the problem of Tinivcrsal 
decoding for arbitrary unknown channels in the random coding regime. For a given random coding 
distribution and a given class of metric decoders, we propose a generic universal decoder whose 
average error probability is, within a sub-exponential multiplicative factor, no larger than that 
of the best decoder in this class of decoders. Since the optimum, ML decoder of the underlying 
channel is not necessarily assumed to belong to the given class of decoders, this setting is suitable 
as a common ground for: 

1. Mismatched decoding (see, e.g., [4], [8], [15]) - when the reference class of decoders is a 
singleton and the ML decoder for the underlying channel is different from the unique decoder 
in this singleton. 

2. Universal decoding for a given family of channels (as in papers cited in the second and third 
paragraphs above) - when the ML decoder for the underlying channel belongs to the given 
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class of decoders. 

3. Universal coding and decoding for deterministic channels using the individual-sequence ap- 
proach (as in [11], [12], [16], [18]) - when the underlying channel is deterministic and the 
universality is relative to a given class of coding/decoding strategies. 

The proof of our universality result is fairly simple and general, and it is demonstrated how some 
earlier mentioned results on universal decoding are obtained as special cases. It is based on very 
simple upper and lower bounds on the probabilities of the pairwise error events, as well as on a lower 
bound due to Shulman [19, Lemma A. 2] on the probability of the union of pairwise independent 
events, which coincides with the union bound up to a factor of 1/2. 

Finally, we demonstrate how our method extends to more complicated scenarios. The first 
extension corresponds to random coding distributions that allow to incorporate noiseless feedback. 
This extension is fairly straightforward, but its main importance is in allowing adaptation of the 
random coding distribution to the channel statistical characteristics. The second extension is to the 
problem of universal decoding for multiple access channels (MAC's) with respect to a given class of 
decoding metrics. This extension is not trivial since the universal decoding metric has to confront 
three different types of error events (in the case of a MAC with two senders). In particular, it turns 
out that the resulting universal decoding metric is surprisingly different from those of earlier works 
on universal decoding for the MAC [10], [5, Section VIII], [17], mostly because the problem setting 
here is different from those of these earlier works (in the sense that the universality here is relative 
to a given class of decoders while the underlying channel is arbitrary, and not relative to a given 
class of channels). 

The outline of the paper is as follows. In Section 2, we establish notation conventions and 
we formalize the problem setting. Section 3 contains our main result and its proof, as well as a 
discussion and examples. Section 4 suggests guidelines for approximating the universal decoding 
metric in situations where it is hard to compute, and thereby shows how Ziv's decoding metric [21] 
falls within our framework. Finally, in Section 5, we provide extensions to the case where feedback 
is available and the case of the MAC. 
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2 Notation Conventions and Problem Formulation 

2.1 Notation Conventions 

Throughout this paper, scalar random variables (RV's) are denoted by capital letters, their sample 
values are denoted by the respective lower case letters, and their alphabets are denoted by the 
respective calligraphic letters. A similar convention applies to random vectors of dimension n and 
their sample values, which will be denoted with same symbols in the bold face font. The set of 
all n-vectors with components taking values in a certain alphabet, will be denoted as the same 
alphabet superscripted by n. Sources and channels will be denoted generically by the letter P or Q. 
For example, the channel input probability distribution function will be denoted by Q{x), x G Af", 
and the conditional probability distribution of the channel output vector y E given the input 
vector X G A'", will be denoted by P{y\x). Information theoretic quantities like entropies and 
conditional entropies, will be denoted following the standard conventions of the information theory 
literature, e.g., H{X), H{X\Y), etc. The expectation operator will be denoted by E{-} and the 
cardinality of a finite set A will be denoted by \A\. 

For a given sequence x G -Y", X being a finite alphabet, Px denotes the empirical distribution 
on X extracted from x, in other words, Px is the vector {Px{x), x G X}, where Px{x) is the 
relative frequency of the letter x in the vector x. The type class of cc, denoted Tx, is the set 
of all sequences x' G X"" with Px' = Px- Similarly, for a pair of sequences {x,y) G X^ x y"', 
the empirical distribution Pxy is the matrix of relative frequencies {Pxy{x,y), x & X, y G 3^} 
and the type class Txy is the set of pairs {x',y') G X^ x with Px'y = Pxy- For a given y, 
Tx\y denotes the conditional type class of x given y, which is the set of vectors {a;'} such that 
(a;', y) G Txy- Information measures induced by empirical distributions, i.e., empirical information 
measures, will be denoted with a hat and a subscript that indicates the sequence(s) from which they 
are induced. For example, Hx{X) is the empirical entropy extracted from x G X'^, namely, the 
entropy of a random variable X whose distribution is Px- Similarly, Hxy{X\Y) and Ixy{X; Y) are, 
respectively, the empirical conditional entropy of X given Y , and the empirical mutual information 
between X and Y, extracted from {x,y), and so on. 

For two sequences of positive numbers, {a„} and the notation a„ = bn means that 

llog^ — )■ as n — )■ CO. Similarly, a„ < bn means that lim sup„_^^ Mog |^ < 0, and so on. 
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The functions log(-) and exp(-), throughout this paper, will be defined to the base 2, unless other- 
wise indicated. The operation [■]_!_ will mean positive clipping, that is = max{0,x}. 

2.2 Problem Formulation 

Consider a random selection of a codebook C = {xi, . . . ,Xm} Q where M = 2"-^, R being 
the coding rate in bits per channel use. The marginal probability distribution function of each 
codeword Xi is denoted by Q{xi). It will be assumed that the various codewords are pairwise 
independent.^ Let P(y|£c) be the conditional probability distribution of the channel output vector 
y E y"' given the channel input vector x G Af". We make no assumptions at all concerning the 
channel.^ We will assume, throughout most of this paper, that both the channel input alphabet X 
and the channel output alphabet y are finite sets. Finally, we define a class of decoding metrics, as 
a class of real functions, ^A = {m0{x, y), ^ G 6, x & A"", y G 3^"}, where 6 is an index set, which 
may be either finite, countably infinite, or uncountably infinite.^ The decoder associated with the 
decoding metric mo, which will be denoted by Vg, decides in favor of the message i G {!,..., M} 
which maximizes mQ{xi,y) for the given received channel output vector y, that is 



The message i is assumed to be uniformly distributed in the set {1, 2, ... , M}. It should be empha- 
sized that the optimum, ML decoding metric for the underlying channel P{y\x), may not necessarily 
belong to the given class of decoding metrics Ai. In other words, this is a problem of universal 
decoding with possible mismatch. 

The average error probability Pe^0{R,n), associated with the decoder Vq, is defined as 



where Pr{-} designates the probability measure pertaining to the randomness of the codebook C 
as well as that of the channel output given its input. 

^E\ill independence of all codewords is allowed, but not enforced. This permits our setting to include, among other 
things, ensembles of linear codes, which are well known to admit pairwise independence, but not stronger notions of 
independence. 

^We even allow a deterministic channel, which puts all its probabilistic mass on one vector y which is given by a 
deterministic function of x. 

^For example, in the uncountably infinite case, 9 may designate a parameter and {me{x,y), € Q} may be a 
smooth parametric family. 



-De: 



i = argmaxi<j<A^m0(a;i,y). 



(1) 
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While the decoder Vq, that minimizes Pe,0{R,n) within the class, depends, in general, on the 
unknown underlying channel P{y\x), our goal is to devise a universal decoder U, with a decoding 
metric U{x,y), independent of the underlying channel P{y\x), whose average error probability 
would be essentially as small as miuQ Pe Q{R,n), whatever the underlying channel may be. By 
"essentially as small", we mean that the average error probability associated with the universal 
decoder, 

Pe,u{R,ri) = ^EPr U {^(^^■'^) ^ U{Xi,Y) Xi sent I , (3) 

i=i j^i 

would not exceed min^ Pe^g{R, n) by more than a multiplicative factor that grows sub exponentially 
with n. This means that whenever mmg Pe^g{R,n) decays exponentially with n, then so does 
Pe^u{R,n), and at an exponential rate at least as fast. Another (essentially equivalent) legitimate 
goal is that Pe,u(-R) n) would not be larger than ming Pe,e{R + n), where A„ ^ as n — )■ cx). In 
the next section, we shall see that both goals are met by a conceptually simple universal decoding 
metric U{x,y), which depends solely on Q and on the reference class M of competing decoding 
metrics. 

3 Main Result 

Consider the given random coding distribution Q and the given family of decoding metrics M = 
{m0{x,y) ,6 G 6}, as defined earlier. Let us define 

T{x\y) ^{x': yeee meix', y) = me{x, y)] . (4) 

Our universal decoding metric is defined as 

U{x,y)^-UogQ[T{x\y)]. (5) 

Note that when ^ is a discrete alphabet, {T{x\y)} are equivalence classes for every y G y^, and 
so the space -Y" can be partitioned into a disjoint union of them. Let Kn{y) denote the number of 
equivalence classes {T{x\y)} for a given y. Also define 

Kn = max Kn{y) (6) 
y^yn 



and 



A„^^. (7) 

n 
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Our main result is the following theorem: 

Theorem 1 Under the assumptions of Section 2, the universal decoding metric defined in eq. (5) 
satisfies: 

Pe,u{R^n)<2- 2"^" ■mmPe,eiR,n) (8) 

and 

PeAR^ < 2 • min Pe,0{R + A„, n). (9) 

Discussion. The theorem is, of course, meaningful when A„ — ?> as n ^ oo, which means that 
the number of various equivalence classes {T{x\y)} grows sub-exponentially as a function of n, 
uniformly in y. As mentioned earlier, in this case, whenever min^ige Pe,e{Pj fi) decays exponentially 
with n, then PgAPji^) decays exponentially as well, and at least as fast. Consequently, the 
maximum information rate pertaining to the universal decoder is at least as large as that of the best 
decoder in the given class. We therefore learn from Theorem 1 that a sufficient condition for 
the existence of a universal decoder is lim„^oo A„ = 0. Whether this is also a necessary condition, 
remains an open question at this point. Necessary and sufficient conditions for universality in the 
ordinary setting have been furnished in [5] and [6] . 

Intuitively, the behavior of A„ for large n is a measure of the richness of the class of decoding 
metrics. The larger is the index set 9, the smaller are the equivalence classes {T{x\y)}, and then 
their total number K^iy) becomes larger, and so does A„. Universality is enabled, using this 
method, as long as the set is not too rich, so that A„ still vanishes as n grows without bound. 

When Q is invariant within T{x\y) (i.e., x' G T{x\y) implies Q{x') = Q{x)), we have 

U{x,y) = -UogQ[T{x\y)] 

= ~\og[Q{x)-\Tix\y)\] 

= --[logg(a;) + log|r(x|2/)|]. (10) 
n 

The choice of a distribution Q that is invariant within T{x\y) is convenient, because in most cases 
it is easier to evaluate the log-cardinality of T{x\y) (or its log-volume, in the continuous case) 
than to evaluate its probability under a general probability measure Q. 
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Before we turn to the proof of Theorem 1, it would be instructive to consider two simple 
examples. In both of them (as well as in other examples in the sequel) Q is invariant within 
T{x\y). 

Example 1. Let Q be the uniform distribution across a single type class, Tx, and let M. be the 
class of additive decoding metrics 

n 

me{x,y) = Y^9{xi,yi), (11) 

i=l 

where {d{x,y), x ^ X, y ^ y} are arbitrary real-valued matrices. In this case, T{x\y) = Tx\y, 
the conditional type class of x given y. Since the number of distinct conditional type classes is 
polynomial in n, then A„ is proportional to (log n)/n. In this case, we have 

Uix,y) = -^logQ[Txiy] (12) 

= ~log[Q{x)-\Txiy\] (13) 
= Hx{X) - Hxy{X\Y) + o{n) (14) 
= ixy{X;Y) + o{n). (15) 

and so, the proposed universal decoder essentially^ coincides with the MMI decoder. However, 
since C is a constant composition code, under this particular choice of Q, Hxi{X) is the same for 
all i, and so, this decoder is equivalent to the decoder that selects the codeword that minimizes the 

empirical conditional entropy of X given Y, namely, mini HxiyiX\Y). If, on the other hand, Q is 
an i.i.d. probability distribution function, namely, Q{x) = n?=i Qi^i), then the universal decoding 
metric becomes 

U{x, y) = ixy{X; Y) + D{Px\\Q) + o(n), (16) 
where D{Px\\Q) is the Kullback-Leibler divergence between Px and Q. 

For certain classes of channels (e.g., arbitrarily varying channels), it is not difficult to derive 
single— letter formulas for the maximum achievable information rates in the random coding regime, 
that is, the supremum of R such that Pe,u{R-, n) ^ as n ^ oo. The main tool for this purpose is 
the method of types. This concludes Example 1. □ 

"^The o(n) term can be omitted with aflFecting the asymptotic performance. 
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Example 2. Let X = y = and let 

Let 6* = (^1, ^2) G and be the class of decoding metrics of the form^ 

n n 

me{x,y) = eiJ2xiyi + 02^2^1 (1^) 

i=l i=l 

In principle, T{x\y) is the set of all {x'} with the same empirical power and the same empir- 
ical correlation with y, as those of x. However, since in this example the sequences x and y 
have continuous-valued components, some tolerance must be allowed in the empirical correlation 
C{x,y) = ^ Yji=i ^iVi ^^'^ empirical power, S{x) = ^ X^iLi ' T{x\y) to have positive proba- 
bility (and positive volume), and so, T{x\y) should be redefined as the set of sequences x', where 
C{x', y) and S{x') are within e (e > 0, but small) close to C{x, y) and S{x), respectively. Using the 
methods developed in [13],^ it is not difficult to show that, after omitting some additive constants 
(which do not affect the decision rule), we have in this case 

U{x,y) = ^ -\ln[S{x){l-ply)], (19) 



where pxy = C{x, y) / S{x)S{y) is the empirical correlation coefficient between x and y, and 
where we have used natural logarithms instead of base 2 logarithms for obvious reasons. The first 
term stems from — ^ h\Q{x) and the second term comes from the negative log-volume of T{x\y). 
This concludes Example 2. □ 

Proof of Theorem 1. The pairwise average error probability, associated with rriQ is lower bounded 

by 

%,e{x,y) ^ J2 Q(^') (20) 

{X': me{X',y)>me{X,y)} 

> E Q(^') (21) 

x'eT{x\y) 

= Q[nx\y)] (22) 

= eM-nU{x,y)]. (23) 



^This class of decoders is clearly motivated by the family of channels yt = axt + Zt, where a is an unknown 
parameter and zt is an i.i.d. Gaussian process, independent of xt- 

®The details are conceptually simple but technically tedious. The interested reader is referred to [13] for a rigorous 
treatment. 
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On the other hand, the pairwise error probability associated with the decoding metric U is upper 
bounded by 

ne,u{x,y) ^ J2 Q(^') (24) 

{X': u(x',y)>u{x,y)} 

= E E Qi^) (25) 

{Tix'lyy. U{x',y)>U(x,y)} xeT{x'\y) 

Y: Q[r{x'\y)] (26) 

mx'ly): u{x',y)>u{x,y)} 

J2 eM-nU{x',y)] (27) 

mx'ly): u{x',y)>u{x,y)} 

< eM-nU{x,y)] (28) 

{T{x'\y): U{x',y)>U{x,y)} 

< E ^M-nU{x,y)] (29) 

mx'm 

< 2"^"exp[-nC/(a;,y)] (30) 

= exp{-n[U{x,y) - An]}, (31) 

where in the second equahty we have used the fact that U{x,y) depends on x and y only via 
T{x\y) and the last inequality follows from the fact that the number of different equivalence 
classes {T{x'\y)} is upper bounded by Kn = 2"^" by definition. Now, as is well known, given x 
and y, the average probability of error can be upper bounded in terms of the average pairwise error 
probability by the expectation of the union bound, clipped to unity, that is 

Pe,u{R^n) <E [mm [l,2''^ne,uiX,Y)]] < [min {l, 2"« exp(-n[i7(X, 1^) - A„])}] , (32) 

where the expectation is w.r.t. the randomness of X and Y, whose joint distribution is given by 
Q{x)P{y\x). 

Next, we need a lower bound on Pefi{R,n) in terms of Ilg^g{x,y). To this end, wc invoke the 
following lower bound on the probability of the union of pairwise independent events Ai, ■ ■ ■ , Am, 
proved by Shulman [19, p. 109, Lemma A.2]^ 



Pr 



U a| > ^•min|l,^Pr(A)|. (33) 



'^A similar result was proved independently in [20, Lemma 1] for fully independent events with equal probabilities. 



11 



In our case, for a given Xi = x and y, the events {m0{Xj,y) > mg{x,y)}j^i are pairwise inde- 
pendent since we have assumed that the various codewords are pairwise independent. Thus, after 
taking the expectation w.r.t. the joint distribution of {X,Y), we have 

Pe,e{R^^) > l-E[mm{lX'^Ue,e{X,Y)}] > ^ ■ [min {l, 2"-^^ exp(-n[/(X, Y)}] . (34) 

Comparing now the right-most side of eq. (32) with that of eq. (34), we readily see that Pe,u(-R) n) 
is upper bounded both by 2Pe,e(-R + ^mn) and by 2 • 2"^^'^Pf.fi{R,n). The first upper bound is 
obtained by combining A„ and R in (32) and the second upper bound is obtained similarly, by 
upper bounding the unity term (in min{l, 2"'[^+^"lne,5t(X, Y")}) by 2"^", which then becomes a 
constant multiplicative factor of the upper bound. Since both inequalities hold for every 9, whereas 
Pe^u{R-,n) is independent of 0, we have actually proved the inequalities 

Pe,n{R. n)<2- minPefi{R + A„, n) (35) 

and 

Pe,u{R^ n)<2- 2^^^" • mm Pe,g{R, n). (36) 
This completes the proof of Theorem 1. □ 

One of the elegant points in [21] is that the universality of the proposed decoding metric, in the 
random coding error exponent sense, is proved using a comparative analysis, without recourse to an 
explicit derivation of the random coding error exponent of the optimum decoder. The above proof 
of Theorem 1 has the same feature. However, thanks to Shulman's lower bound on the probability 
of a union of events, the proof here is both simpler and more general than in [21], in several 
respects: (i) it allows a general random coding distribution Q, not just the uniform distribution, 

(ii) it requires only pairwise independence and not full independence between the codewords, and 

(iii) it assumes nothing concerning the underlying channel. Indeed, it will be seen shortly How 
Ziv's universal decoding metric is obtained as a special case of our approach. 

We summarize a few important points: 

1. We have defined a fairly general framework for universal decoding, allowing a general random 
coding distribution Q, a general channel, and a a general family of decoding metrics {mg, 9 G 
6}. Most of the previous works in universal decoding, mentioned in the second and the third 
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paragraphs of the Introduction, relate to the special case where the ML decoder for the given 
channel is equivalent to mo for a certain choice of 0. 

2. Another special case that falls within our framework is mismatched decoding: In this case, Q 
is a singleton and the unique decoding metric me in this singleton is different from the ML 
decoding metric of the actual channel. 

3. Yet another special case is the case where the channel is deterministic. This is partially 
related to the "individual channel" paradigm due to Lomnitz and Feder (see, e.g., [11], [12] 
among many other papers), Misra and Weissman [16], and Shayevitz and Feder [18]. The 
main difference is that here, we are not concerned with universality of the encoder, as we 
simply assume a fixed random coding distribution. In the absence of feedback, there is no 
hope for universal encoding. 

4 Useful Approximations of the Universal Decoding Metric 

In some situations, it may not be a trivial task to evaluate Q[T{x\y)], which is needed in order 
to implement the proposed universal decoding metric. Suppose, however, that one can uniformly 
lower bound Q[T{x\y)] = exp{—nU{x,y)} by cxp{— nil' {x,y)}, for some function U'{x,y) which 
is computable and suppose that U'{-,-) is not too large in the sense that it satisfies the following 
condition: 

max y Q(a;)2"^'(=^'?/) < 2^*^" (37) 

where 0. We argue that in such a case, U'{-, •) can replace [/(•, •) as a universal decoding 

metric and Theorem 1 remains valid. 

To see why this is true, first observe that Ilefi{x, y) is trivially lower bounded by exp{— nC/'(£C, y)}, 
following (34) and the very definition of U'{x,y) as an upper bound on U{x,y). As for the upper 
bound, we have 

Ue,u'{x,y) ^ J2 Q(^') (38) 

{X': U'{x',y)>u'{x,y)} 

= exp[-nU'{x,y)]- J2 Q(a;')exp[nC/'(a;,y)] (39) 

{£C': U'{x',y)>U'{x,y)} 
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< e^p[-nU'{x,y)]- ^ Q{x')eMnU'{x',y)] (40) 

{X': U'{x',y)>U'{x,y)} 

< eM-nU'{x,y)]- J2 Q{x')eMnU'{x' ,y)] (41) 

x'ex-^ 

< e^p{-n[U'{x,y)-A'J}. (42) 

Now, the corresponding upper bounds on P^ui{R,n), in terms of miiiQ Pe0{R,n), are derived as 
before, just with U replaced by U' . The price of passing from U to U' might be in a slowdown 
of the convergence of vs. A„. For example, U' might correspond to more refined equivalence 
classes {T{x\y)}. 

As an example of the usefulness of this result, let us refer to Ziv's universal decoding metric for 
finite-state channels [21]. In particular, let M. be the class of decoding metrics corresponding to 
finite-state channels, defined as follows: For a given x G and y G 3^", let s = (si, . . . , s„) G 5" 
(5 being a finite set), be a sequence generated recursively according to 

Si+\= g{xi,yi,Si), i = l,...,n-l, (43) 

where si is some fixed initial state and g : X x y x S ^ S \s a. certain next-state function. Now 
define 

n 

me{x,y) = ^9{xi,yi,Si), (44) 
1=1 

where {9{x,y,s), x e X, y e y, s G <S} are arbitrary real valued parameters. Similarly as in 
[21], suppose that Q{x) is the uniform distribution over X""^. Then (5[T(a;[y)] is proportional to 
|T(a;|t/)|, but the problem is that here, unlike in Example 1, there is no apparent single-letter 
expression® for the exponential growth rate of |T(a3|y)| in general (unless the state variable in eq. 
(43) depends solely on the previous state and the previous channel output). Moreover, |T(a;|y)| 
depends on the next-state function g in eq. (43), which is assumed unknown. Fortunately enough, 
however, |T(a;|y)|, in this case, can be lower bounded [21, Lemma 1] by 

\T{x\y)\ > 2^^(^li/)-M")^ (45) 

where LZ{x\y) denotes the length (in bits) of the conditional Lempel-Ziv code (see [21, proof of 

Lemma 2] , [14] ) of x when y is given as side information at both encoder and decoder. Consequently, 

*In a nutshell, had there been such a single-letter expression, one could have easily derived a single— letter expression 
for the entropy rate of a hidden Markov process [1, Section 4.5] using the method of types. 



14 



one can upper bound U{x,y) by 

U\x, y) = log \X\ - + o(n) (46) 

as our decoding metric. Indeed, eq. (37) is satisfied by this choice of U' since 

r)nU'{X,y) 

Eg(a.)2"^'(^.?/) = Y.-T^ (47) 
X X \'^\ 

= ^2"-^^(*l^)+"''(") (48) 

X 

< 2'*°('*), (49) 

where the last equality is Kraft's inequality which holds since LZ{x\y) is a length function of x 
for every y. This explains why Ziv's decoder, which selects the message i with the minimum of 
LZ{xi\y), is universally asymptotically optimum in the random coding exponent sense. Note that 
the assumption that Q is uniform is not really essential here. In fact, Q can also be any exchangeable 
probability distribution (i.e., x' is a permutation of x implies Q{x') = Q{x)). Moreover, if the state 
variable Sj includes a component, say, Ui, that is fed merely by {xi] (but not {ui}), then it is enough 
that Q would be invariant within conditional types of x given <t = (ai, . . . , (7„). In such a case, we 
would have 

U'{x.y) = —[logQix) + LZ{x\y)]. (50) 

5 Extensions 

We now demonstrate how our method extends to more involved scenarios of communication sys- 
tems. The first extension corresponds to random coding distributions that allow access to noiseless 
feedback. While this extension is not complicated, it is important from the operational point of 

view, because feedback allows the encoder to learn the channel and thereby to adapt the random 
coding distribution to the channel statistical characteristics. 

Our second extension is to the problem of universal decoding for multiple access channels 
(MAC's) with respect to a given class of decoding metrics (again, without feedback, but the exten- 
sion that combines feedback is again straightforward). This extension is deliberately not provided 
in full generality in the sense that we make a certain facilitating assumption on the structure of 
the class of decoding metrics, in order to make the analysis simpler. The main point here is not 
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the quest for full generality, but to demonstrate that this extension, even under this facilitating 
assumption, is not a trivial task since the universal decoding metric has to confront three different 
types of error events (in the case of a MAC with two senders): (i) the event were both messages are 
decoded incorrectly, (ii) the event where only the message of sender no. 1 is decoded incorrectly, and 
(iii) the event where only the message of sender no. 2 is decoded incorrectly. As a consequence, it 
turns out that the resulting universal decoding metric is surprisingly different from those of earlier 
works on universal decoding for the MAC [10], [5, Section VIII], [17], mostly because the problem 
setting here is different (and more general) from those of these earlier works (in the sense that the 
universality here is relative to a given class of decoders while the underlying channel is arbitrary, 
and not relative to a given class of channels). While we are not arguing that all the universal 
decoders of these previous articles are necessarily suboptimum in our scenario, we are able to prove 
the universality only for our own universal decoding metric. 

5.1 Feedback 

In the paradigm of random coding in the presence of feedback, it is convenient to think of an 
independent random selection of symbols of X along a tree whose branches are labeled by 

{yi}, {2/1,2/2}, • • • , {yi, ■ ■ ■,yn-i}, 

for all possible outcomes of these vectors. Accordingly, the random coding distribution Q{x) is 
replaced by 

n 

Q{x\y)^l[Qixi\x'-\y'-'). (51) 

i=l 

Thus, each message i G {1,2,... , M} is represented by a complete tree of depth n and |3^|"~^ 
leaves. Theorem 1 and its proof remain intact with Q{-) being replaced by Q{-\y) in all places. 
Thus, the universal decoding metric is redefined as 

Uix,y) = -UogQ[T{x\y)\y], (52) 

the expectation in eqs. (32) and (34) is redefined w.r.t. 

n 

P{x,y) = l[[Qixi\x'-\y'-')P{y,\x\y'-% (53) 

i=l 

and in condition (37), Q{x) is replaced by Q{x\y). 
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One might limit the structure of the feedback, for example, by letting each Q{-\x^ ^) 
depend on {x^~^,y''~^) only via a state variable ti fed by these two sequences, i.e., 

U = 9{ti-i,Xi-i,yi-i), (54) 

that is 

n n 

Q{x\y) = n Qixi\x'-\y'-^) = [] Q{xi\ti). (55) 

i=l 1=1 
In the above example of decoding metrics corresponding to finite-state channels, one can refine 

the equivalence classes to include the information about U (see Section 4), and then Q would be 

invariant within a type class T^^y g where t = (ti, . . . , tn)- In this case, the decoding metric U' 

would become 

U'ix,y) = ~[logQix\y) + LZ{x\y)], (56) 
where Q{x\y) is understood to be defined according to eq. (55). 

5.2 The Multiple Access Channel 

Consider an arbitrary multiple access channel (MAC), namely, a channel with two inputs, Xi and 
X2, and one output y. The two inputs are used by two different users which do not cooperate. 
User no. 1 generates Mi = 2"^i independent codewords, Xi{l), . . . , Xi{Mi), using a random coding 
distribution Qi, and user no. 2 generates M2 = 2"^2 independent codewords, £C2(1),... ,X2{M2), 
using a random coding distribution (52-^ 

We define a class Ai of decoding metrics {mg{xi,X2,y), G 6}. Decoder Vg picks the pair 

of messages (a;i(i), a;2(j)), i G {!,..., Mi}, j G {1,...,M2}, which maximizes mo{xi{i),X2{j),y)- 

We assume that the random coding ensemble and the class of decoders is such that for every 

y, m0{Xi{i),X2{j),y) and m0{Xi{i'), X2{j'),y) are statistically independent whenever ^ 

While this requirement is easily satisfied when the both i i' and j ^ f (for example, when 

all codewords are drawn by independent random selection), it is less obvious for combinations of 

pairs and for which either i = i' or j = j' (but, of course, not both). Still, this 

requirement is satisfied, for example, if A'l = X2 = {0,1, . . . , K — 1} (or the continuous interval 

^We should point out that a more general model definition should allow time-sharing, which means that the 
codewords of both users should be drawn conditionally independently given a sequence s the designates the time- 
sharing protocol known to all parties. This will just amount to conditioning many quantities on s. For the sake of 
simplicity of the exposition, we will not add this conditioning on s. 
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[0,^4]), Qi and Q2 are both uniform across the alphabet, and mo{xi,X2,y) depends on xi and 032 
only via Xi(Bx2, where © denotes addition modulo K (or addition modulo A, in the example of the 
continuous case). Decoding metrics with this property are motivated by classes of multiple access 
channels, P{y\xi,X2), in which the users interfere with each other additively, i.e., P{y\xi,X2) = 
W{y\xi X2)- Still, the dependence of y on Xi® X2 can be arbitrary. In other words, the channel 
is known to depend only on the modulo 2 sum of the inputs, but the form of this dependence 
may not be known. Another example where the above independence requirement is met is when 
X\ = X2 = {—1, +1} and m^t depends on x\ and X2 only via their component-wise product X\ ■ X2- 

We now define three kinds of equivalence classes: 

T{xx,X2\y) = {{x[,X2) : e@ m0{x[,X2,y) = m0{xi,X2,y)} (57) 

Tixi\x2,y) = {x[: "ie e Q me{x[,x2,y) = m0{xi,x2,y)} 

= {x[: {x[,X2) eT{xi,X2\y)} (58) 

T{x2\xi,y) = {x'2: e@ m0{xi,X2,y) = m0{xi,X2,y)} 

= {x'2: {xi,x'2) eT{xi,X2\y)}. (59) 

We also assume, as before, that for every y, the number of different type classes {T{xi,X2\y)} is 
upper bounded by 2"^". Next, define the following functions: 

Uo{xi,X2,y) = -^log{{QixQ2)[T{xi,X2\y)]} (60) 

Ul{xi,X2,y) = -;^log(5l[T(£Cl|£C2,y)] (61) 

U2{xi,X2,y) = --logQ2[T{x2\xi,y)]. (62) 
n 

What makes the MAC interesting, in the context of universal decoding, is that the universal decoder 
has to cope with three different types of errors: (i) both messages are decoded incorrectly, (ii) the 
message of user no. 2 is decoded correctly, but that of user no. 1 is not, and (iii) like (ii), but with 
the roles of the users swapped. Prom Theorem 1 and its proof (after an obvious modification), it 
is apparent that had only errors of type (i) existed, then Uo could have been a universal decoding 
metric. Similarly, had only errors of type (ii) existed, then Ui could be a universal decoding metric, 
and by the same token, for error of type (iii) alone, one would use U2- However, in reality, all three 
types of error events might occur and we need one universal decoding metric that handles all of 
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them at the same time. The question is then how to combine Uq, Ui and U2 into one metric that 
would work at least as well as the best decoder in the given class. 

The answer turns out to be the following: Define the universal decoding metric as 
U{xi,X2,y) = min{[i7o(£Ci,£C2,?/) - -Ri - -R2], [Ui{xi,X2,y) - Ri], [U2{xi,X2,y) - R2]} ■ (63) 

We argue that U{xi,X2,y) competes favorably with the best mo in a sense analogous to that 
asserted in Theorem 1. This decoding metric is different from the universal decoding metrics used 
for the MAC, for example, in [17] and [10], which were based on the MMI decoder and the minimum 
empirical conditional entropy (minimum equivocation) rule, respectively. It is not argued here that 
these decoding rules are necessarily suboptimal in the present setting, but on the other hand, we do 
not have a proof that they compete favorably with the best decoder in the class Ai. The remaining 
part of this section is devoted to a description of the main modifications and extensions needed in 
the proof of Theorem 1 in order to prove the universality of U{xi,X2, y) for the MAC. 

The pairwise probability of type (i) error for an arbitrary decoder in the reference class M. is 
lower bounded by 

P^}{x^,X2,y) = J2 Qi{OQ2{x'2) (64) 

{X[,X'2- m0{X[,X'^,y)>mg{Xi,X2,y)} 

> Qi{x'i)Q2{x'2) (65) 

{x[,x'2)<^Tixi,X2\y) 

= iQixQ2)[T{xi,X2\y)] (66) 

^ 2-nUo{xuX2,y)_ (-g^^ 

As for the pairwise error probability of type (ii), we have 

P^^ixi,X2,y) = E Qii^i) (68) 

{X'^i me{X'^,X2,y)>mg{Xi,X2,y)} 

> E Qi(^i) (69) 

x'^(^T{xi\X2,y) 

= Qi[r(a;i|a;2,y)] (70) 

^ 2-nUi{Xi,X2,y) _ ^Yl) 



and similarly, for type (iii): 



P^fixi,X2,y) > 2-"^2(a;i,a;2,y)_ (72) 
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Let X be half of the set {1, 2, ... , Mi} - {i} and let J be half of the set {1, 2, ... , M2} - {j} {i and 
j being the correct messages of the two senders). Let A be the set of all (Mi — 1)(M2 — l)/4 pairs 
G X"^ X J'^, where both i' ^ i and j' ^ j. Under our above assumptions, the following is true: 
given {xi{i),X2{j),y), the events 

{m0{Xi{i'),X2ij),y) > m0{xi{i),X2{j),y)}i'ei, 

{m0{xi{i),X2{j'),y) > m0{xi{i),X2{j),y)}j'ej, 

and 

{m0{Xi{i/),X2{j'),y) > me{xi{i),X2{j),y)}(i',f)eA 

are all pairwise independent. Defining the set of pairs B = AU [{i} x JT"] U [X x {j}], the total 
probability of error, associated with the decoder Vq, is lower bounded as follows: 

Pe,9{Ri,R2,n) (73) 

1 Ml M2 f 

= TTirrEEPr U \meiX^ii'),X2ij'),Y)>me{X,{z),X2ij),Y) 

1 Ml M2 . 

^ TFTtEEP'^ U \me{X,{^),X2ij'),Y)>me{X,{i),X2ij),Y){z 
> EuAn[l^-^^^l^^f^^.2--^oiX.'^.,Y)+ 

(^1 - 1) . 2-nUi{XuX2,Y) + (^2 - 1) _ 2-nC/2(Xi,X2,r)| (7g) 

= £;min|l,2""[^o(^i'^2,V)-Ri-R2] + 

2-n[i7i(Xi,X2,F)-i?i] _^ 2-"['^2(Xi,X2,l")-R2] I (77) 

= i;min{l,2-"^(^i'^2,l^)| (78) 

= £:{2~"['^(-^i'-^2,i^)]+|^ (79) 

where the second inequality is again due to Shulman [19, Lemma A. 2]. Consider now the function 
U{xi,X2,y) as a universal decoding metric. Then, we have the following: 

P^l{x,,X2,y) = Yl Q,{x[)Q2{x'2) (80) 

{(cc;,a;^): u{x[,x'2,y)>u{Xi,X2,y)} 

= E E Ql{Xl)Q2{X2) (81) 

{T(x[,x'^\yy. u{x'^,x'^,y)>uiXuX2,y)} {xi,X2)(iT{x'^,x'^\y) 
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Similarly, 



E {QixQ2)[T{x[,x'^\y)] 
{Tix[,x!2\yy. uix[,x'2,y)>u{XuX2,y)} 

< E exp[-nUo{x[,X2,y)]. 

{T(x[,x'2\yy. u{x'^,x'^,y)>uiXuX2,y)} 



P^i}{x,,X2,y) = E 

{a;;: u{x[,X2,y)>u{XuX2,y)} 

= E E 

{r(a;'i|£C2,y): u{x[,X2,y)>u{xi,X2,y)} xieT{x[\X2,y) 

< E E 
{T{x[ ,x'2 \yy. u{x[,x'2,y}>u{Xi,X2,y)} eT(x[\x'^ ,y) 

E Qi[Tix[\x'2,y)] 
{Tix[,x'2\yy- uix[,x'2,y)>u{xi,X2,y)} 

< E exp[-nUi{x[,x'2,y)], 
{Tix[,x'2\yy u{x[,x'2,y)>uixuX2,y)} 



and by the same token, 

P^^i:\xi,X2,y) < 

Now, 



E exp[-nU2ix[,x'2,y)]. 
{T{x\,x'^\yy u{x[,x'^,y)>u{xi,X2,y)} 



1 



Ml M2 



M1M2 



EE^M U UiX,ii'),X2if),Y) > U{X,ii),X2ij),Y) sent 



i=ij=i 

= E min {1, 2"(-^i+^2)piJ (Xi, X2, Y)+ 

2nRipin)^X^,X2, Y) + 2"^^Pi;J)(Xi, X2, Y)} 



= E min < 



2-n[Uo{X'^,X'^,Y)-Ri-R2\_i^ 



i> E 

{r(a;l,a;i|y): u{x'^,x'^y)>u{XuX2y)} 
2-n[Ui{x'^,x'^,Y)-Ri] _|_ 2-n[c/2(a;;,a;^,y)-ii2]j | 



= Eminll, E 2-"^(^i'=^2,>^) 

^ {r(a;;,a;i|y): u[x'^,x'^X)>u{XuX2X)} 

< ^JminO, E 2-"^(^i'^2,y) 

{T{x'^,x'^\Yy u{x'^,x'^X)>u{X^,X2X)} 
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< Eminh, J2 2-"^(^i'^2,r) ^ (97) 

[ {nx[,x'^\Y)} 

< £;min{l,2-"[^(^i'^2,l")-A„]| ^gg^ 

which is of the same exponential order as the lower bound on Pg^g(R,n), and hence Pe^u^R^n) is 
exponentially at least as small as uuivq^q P^fi{R,n). 

Similarly as in Section 4, suppose that [/q, U\ and U2 can be uniformly upper bounded by J/q, 
U[ and respectively, and assume that: 

max Qi{xi)Q2{x2)2''^^^^''^^'y'> < 1 (100) 

^ £Ci,£C2 

maxV Ql(xl)2"^^(^l'=^2'?/) < 1 (101) 

maxyg2(a;i)2"^2(^i'=^2'?/) < 1. (102) 

Then, Uq, U[ and U2 can replace Uo, U\ and i72, respectively, in the universal decoding metric, 
denoted in turn by U' , and the upper and lower bounds continue to hold with U' replacing U. The 
lower bounds on Pg*](a;i, 3:2, y), P^^q {xi,X2-,y), and P^^g^\xi,X2,y), in terms of Uq, U[ and U!^., 
respectively, are trivial, of course. As for the upper bounds on p'^^^i{xi,X2, y), P^^^,{xi,X2, y), and 
P^^^) {xi,X2,y), we proceed similarly as follows: 

P^l,{xi,X2,y) = J2 Qiix[)Q2{x'2) (103) 

{(x[,x'2y. u'{x[,x'^,y)>u'{Xi,X2,y)} 

= E E Qi{xi)Q2{x2im 

{T(x[,x'^\yy. u'ix[,x'^,y)>u'{XuX2,y)} (xi,X2)<^T{x'^,x'^\y) 
= 2-n%{x'^,x'^,y) ^ 

{T{x\,x'^\y): u'{x\,x'^,y)>u'{Xi,X2,y)} 

E Qi(&i)Q2(*2)2"^o{a:'i,a;'„y) (jQg) 

{Xi,X2)eT{x'^,x'^\y) 
= Y 2-''^o(x'i,x'2,y) X 

{Tix'^,x'^\yy. U'{x{,x'2,y)>u'ixux2,y)} 

E Qi(ii)Q2(*2)2"^o(*i.*2,y) (106) 

iXi,X2)eT{x[,x'2\y) 

< J2 2-"^o(a;'i,«2.y) (107) 

{r(a;;,£cy2/): c/'(a;;,a;^,y)>c/'(a;i,a;2,y)} 
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and similar treatments hold for P^^^,{xi,X2,y), and P^^!^, {xi,X2,y)- This suggests that, in the 
case where the class M is based on finite-state machines, 

n 

m0{xi,X2, y) = Y1 d{xi^i, X2,i, Vi, Si), Si+i = g{xi^i, X2,i, yi, Si) (108) 

1=1 

and Qi and Q2 are uniform distributions within single type classes, one may use LZ{xi,X2\y), 
LZ(xi\x2,y) and LZ{x2\xi,y) in the relevant places, i.e., the universal decoding metric would be 



U'{xi,X2,y) = min. 



n 
n 



(109) 



Thus, we observe that our approach suggests a systematic method to extend earlier results to more 
involved scenarios, like that of the MAC. 
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